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Abstract 

The vector autoregression (VAR) has long proven to be an effective method for modeling the joint dynamics 
of macroeconomic time series as well as forecasting. A major shortcoming of the VAR that has hindered 
its applicability is its heavy parameterization: the parameter space grows quadratically with the number of 
series included, quickly exhausting the available degrees of freedom. Consequently, forecasting using VARs 
is intractable for low-frequency, high-dimensional macroeconomic data. However, empirical evidence suggests 
that VARs that incorporate more component series tend to result in more accurate forecasts. Conventional 
methods that allow for the estimation of large VARs either tend to require ad hoc subjective specifications 
or are computationally infeasible. Moreover, as global economies become more intricately intertwined, there 
has been substantial interest in incorporating the impact of stochastic, unmodeled exogenous variables. Vector 
autoregression with exogenous variables (VARX) extends the VAR to allow for the inclusion of unmodeled 
variables, but it similarly faces dimensionality challenges. 

We introduce the VARX-L framework, a structured family of VARX models, and provide methodology that 
allows for both efficient estimation and accurate forecasting in high-dimensional analysis. VARX-L adapts several 
prominent scalar regression regularization techniques to a vector time series context in order to greatly reduce the 
parameter space of VAR and VARX models. We also highlight a compelling extension that allows for shrinking 
toward reference models, such as a vector random walk. We demonstrate the efficacy of VARX-L in both low- 
and high-dimensional macroeconomic forecasting applications and simulated data examples. Our methodology 
is easily reproducible in a publicly available R package. 
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1 Introduction 


The practice of macroeconomic forecasting was spearheaded by Klein and Goldberger (1955), whose eponymous 
simultaneous equation system jointly forecasted the behavior of 15 annual macroeconomic indicators, including 
consumer expenditures, interest rates, and corporate profits. The parameterization and identification restrictions of 
these models were heavily influenced by Keynesian economic theory. As computing power increased, such models 
became larger and began to utilize higher frequency data. Forecasts and simulations from these models were 
commonly used to inform government policymakers as to the overall state of the economy and to influence policy 


decisions (Welfe 2013). As the Klein-Goldberger model and its extensions were primarily motivated by Keynesian 
economic theory, the collapse of the Bretton Woods monetary system and severe oil price shocks led to widespread 


forecasting failure in the 1970s (Diebold 1998). At this time, the vector autoregression (VAR), popularized by Sims 


( 1980| , emerged as an atheoretical forecasting technique underpinned by statistical methodology and not subject 
to the ebb and flow of contemporary macroeconomic theory. 

Unfortunately, the VAR’s flexibility can create modeling complications. In the absence of prior information, 
the VAR assumes that every series interacts linearly with both its own past values as well as those of every other 
included series. Such a model is known as an unrestricted VAR. As most economic series are low-frequency (monthly, 
quarterly, or annual) there is rarely enough available data to accurately forecast using large unrestricted VARs. 
Such models are overparameterized, provide inaccurate forecasts, and are very sensitive to changes in economic 
variables. Gonsequently, in such applications, the VAR’s parameter space must be reduced, either in a data-driven 
manner or based upon the modeler’s knowledge of the underlying economic system. This model selection process 
has been described as “blending data and personal beliefs according to a subjective, undocumented procedure that 


others cannot duplicate” (Todd 1990)[p. 18]. 

Despite their overparameterization, in many applications, large VARs can be preferable to their smaller counterparts, 
as small models can exclude potentially relevant variables. Ideally, if one has no prior knowledge that a variable 


is irrelevant, it should be included in the model. For example, as described in Liitkepohl (2014), modeling the 


Taylor Rule (Taylor 1993) requires an estimate of the “output gap” between real Gross Domestic Product and 
potential output. The output gap is difficult to measure and can include many explanatory variables encompassing 


disaggregated economic measurements. Moreover, recent work by Ibarra (2012) and Hendry and Hubrich (2011) 
have shown that incorporating disaggregated series improves upon the forecasts of macroeconomic aggregates such 
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as the Consumer Price Index. Hence, in these scenarios, to properly utilize all relevant economic information, a 
large vector autoregression with a coherent variable selection procedure is required. 

Shortly after the VAR’s inception, efforts were made to develop a systematic approach to reduce its parameterization. 


Early attempts, such as Litterman (1979), centered upon a Bayesian approach underpinned by contemporary 
macroeconomic theory. In applying a Bayesian VAR with a Gaussian prior (analogous to ridge regression), specific 
priors were formulated based upon stylized facts regarding US macroeconomic data. For example, the popular 
Minnesota prior incorporates the prevailing belief that macroeconomic variables can be reasonably modeled by a 
univariate random walk via shrinking estimated models toward univariate unit root processes. 


The Bayesian VAR with a Minnesota prior was shown by Robertson and Tallman (1999) to produce forecasts 
superior to the conventional VAR, univariate models, and traditional simultaneous equation models. However, this 
approach is very restrictive; in particular, it assumes that all series are contemporaneously uncorrelated, and it 
requires the specification of several hyperparameters. Moreover, the Minnesota prior cannot accommodate large 


VARs itself. As pointed out by Banbura et al. (2009), when constructing a 40 variable system, in addition to 


the Minnesota prior, Litterman (1986b) imposes strict economically-motivated restrictions to limit the number of 
variables in each equation. 


Modern Bayesian VAR extensions originally proposed in Kadiyala and Karlsson (1997) and compiled by Koop 


( 2011| show that empirical regularization methods alone allow for the accurate forecasting of large VARs. Such 
procedures impose data-driven restrictions on the parameter space while allowing for more general covariance 
specifications and estimation of hyperparameters via empirical Bayes or Markov chain Monte Carlo methods. These 
approaches are computationally expensive, and multi-step forecasts are nonlinear and must be obtained by additional 


simulation. Using a conjugate Gaussian-Wishart prior, Banbura et al. (2009) extend the Minnesota prior to a 
high-dimensional setting with a closed-form posterior distribution. Their technique uses a single hyperparameter, 
which is estimated by cross-validation. However, their specification does not perform variable selection, and their 
penalty parameter selection procedure is more natural within a frequentist framework. 


More recent attempts to reduce the parameter space of VARs have incorporated the lasso (Tibshirani 1996), 


a least squares variable selection technique. These approaches include the lasso- VAR proposed by |Hsu et "ah 


(2008) and further explored by Song and Bickel (2011), Li and Chen (2014), and Davis et al. (2015). Theoretical 


properties were investigated by Kock and Callot (2015) and by Basu and Michailidis (2015). Gefang (2012) considers 
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a Bayesian implementation of the elastic net, an extension of the lasso proposed by Zou and Hastie (2005) that 
accounts for highly correlated covariates. However, their implementation is not computationally tractable and they 
do not observe much of a forecasting improvement over existing methods. The lasso-VAR has several advantages 
over Bayesian approaches as it is more computationally efficient in high dimensions, performs variable selection, 
and can readily compute multi-step forecasts and their associated prediction intervals. 

In many applications, a VAR’s forecasts can be improved by incorporating unmodeled exogenous variables, 
which are determined outside of the VAR. Examples of exogenous variables are context-dependent and range from 
leading indicators to weather-related measurements. In many scenarios, global macroeconomic variables, such as 
world oil prices, may be considered exogenous. Such models are most commonly referred to as “VARX” in the 
econometrics literature, but they are also known as “transfer function” or “distributed lag” models. 

VARX has become an especially popular approach in the modeling of small open economies, as they are generally 
sensitive to a wide variety of global macroeconomic variables which evolve independently of their internal indicators. 


For example Cushman and Zha (1997) use a structural VARX model to analyze the effect of monetary policy shocks 
in Canada. The VARX is also amenable under scenarios in which forecasts are desired only from a subset of the 
included series in a VAR, as by construction its corresponding VARX has a reduced parameterization. VARX 


models have received considerable attention not just in economics, but also marketing (Nijs et al. 2007), political 
science (Wood, 20091, and real estate (Brooks and Tsolacos 2000| . 

Unfortunately, dimensionality issues have limited the utility of the VARX. As a result of the aforementioned 
overparameterization concerns, in the conventional unrestricted VAR context most applications are limited to no 


more than 6 series (Bernanke et al. 2005), forcing the practitioner to specify a priori a reduced subset of series 


to include. The VARX faces similar restrictions. As outlined in Penm et al. (1993), lag order, the maximum 
number of lagged observations to include, may differ between modeled and unmodeled series. Hence, in order to 
select a VARX model using standard information-criterion minimization based methods, one must fit all subset 
models up to the predetermined maximal lag order for both the series of forecasting interest (which we refer to 
as endogenous throughout this paper) and the exogenous series. Moreover, unlike the conventional VAR, zero 
constraints (restrictions fixing certain model parameters to zero) are generally expected. 

As it is often viewed as an economic rather than statistical problem, reducing the parameter space of the VARX 


model has received considerably less attention. Ocampo and Rodriguez (2012) extend the aforementioned Bayesian 
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VAR estimation methods to the VARX context. George et al. (2008) apply stochastic search variable selection to 
the VARX framework; it provides a data-driven method to determine zero restrictions, but their approach is not 


scalable to high dimensions. Chiuso and Pillonetto (2010) propose estimating a VARX model with lasso and group 
lasso penalties but do not elaborate on potential group structures. 

This paper seeks to bridge the considerable gap between the regularization and macroeconomic forecasting 
communities. We develop the VARX-L framework which allows for high-dimensional penalized VARX estimation 
while incorporating the unique structure of the VARX model in a computationally efficient manner. In order to 
implement this framework, we develop substantial modifications to existing lasso and group lasso solution algorithms, 
which were designed primarily for univariate regression applications with no time dependence. 

We extend the lasso and its structured counterparts to impose structured sparsity on the VARX, taking into 
account characteristics such as lag coefficient matrices, the delineation between a component’s own lags and 
those of another component, and a potential nested structure between endogenous and exogenous variables. Our 
methods offer great flexibility in capturing the underlying dynamics of an economic system while imposing minimal 
assumptions on the parameter space. 

Moreover, unlike previous approaches, due to our adaptation of convex optimization algorithms to a multivariate 
time series setting, our models are well-suited for the simultaneous forecasting of high-dimensional low-frequency 
macroeconomic time series. In particular, our models allow for prediction under scenarios in which the number of 
component series and included exogenous variables is close to or exceeds the length of the series. Our procedures, 
which avoid the use of subjective or complex hyperparameters, are publicly available in our R package BigVAR and 
can be easily applied by practitioners. 

Sectionj^describes the notation used throughout the paper and introduces our structured regularization methodology. 

Section provides our implementation details and presents three macroeconomic forecasting applications. 
Sectionj^details the “Minnesota VARX-L,” an extension that allows for the incorporation of unit root nonstationarity 
by shrinking toward a vector random walk, section presents a simulation study, and section contains our 
conclusion. The appendix details the solution strategies and algorithms that comprise the VARX-L class of models. 


2 Methodology 

A fc-dimensional multivariate time series {yt}J^i and m-dimensional exogenous multivariate time series 
follow a vector autoregression with exogenous variables of order {p,s), denoted YARXk,m{p, s), if the following 
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linear relationship holds (conditional on initialization): 


p s 

yt = t^ + '^ + Ut for t = 1,..., T, 

i=i j=i 


( 2 . 1 ) 


in which ly denotes a fc-dimensional constant intercept vector, represents a k x k endogenous coefficient matrix 
at lag £ = 1,... ,p, represents a k x m exogenous coefficient matrix at lag j = 1,..., s, and Uj denotes a 
/c-dimensional white noise vector that is independent and identically distributed with mean zero and nonsingular 


covariance matrix A VAR, which is a special case of the VARX, can be represented by Equation (2.1) with 
= 0 for j = 1,..., s. 

In a low-dimensional setting, in which the number of included predictors is substantially smaller than the length 
of the series, T, the VARX model can be fit by multivariate least squares, with !><,$ = [$^^\..., and 

(3 = [I3^^\ ■ ■. estimated as 


T 

argmin ||yt 

v.^,£3 


t=i 


_ ly _ ^ ^ (3''^'^yLt_j\ 


2 


( 2 . 2 ) 


t=i 


i=i 


in which ||A|jj’ = j' denotes the Frobenius norm of a matrix A, which reduces to the L 2 norm when A is 
a vector. In the absence of regularization, the VARXj, „j(p, s) requires the estimation of k(l kp + ms) regression 


parameters. In the following section, we will apply several convex penalties to Equation (2.21 which aid in reducing 
the parameter space of the VARX by imposing sparsity in $ and /3. 

2.1 VARX-L: Structured Penalties for VARX Modeling 

In this section, we introduce VARX-L, a general penalized multivariate regression framework for large VARX 
models. We consider structured objectives of the form 


T p s 

mn V l|yt + ^( ^y(^) + '^x{f3) ), 


e=i 




(2.3) 


in which A > 0 is a penalty parameter, which is selected in a sequential, rolling manner according to a procedure 


that is discussed in section 3.1 Vy{^) denotes a penalty function on endogenous coefficients, and 'Px{f3) denotes a 


penalty function on exogenous coefficients. 

Table [2 details the penalty structures proposed in this paper; all but the last have this separable structure. In 
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Tclblc 1' The Proposed VARX-L Penalty Functions. Note that and denote the diagonal and ofF-diagonal elements of coefficient matrix 

respectively. 


Group Name 

■PyW 

Px(/3) 

(2.4) Lag 

(2.5) Own/Other 

(2.6) Sparse Lag 

(2.7) Sparse Own/Other 

(2.8) Basic 

ll*olblF + \/fc(fc-i)ELi 
(1 - +«ll*lll 

(1 -o)(\/fcELl ll*J.n’llF+Vfc(fc-l)ELl ll*rftllF) +c||*||l 

(i-c«)v^e”=iE’Li1I/9“^iif +^\mi 
ll/3||l 

(2.9) Endogenous-First 

r’a.x(4',/3) = ELi VLl (lll'*’iv 

If + ii0)Vif) 


the following section, we will discuss each penalty structure in detail. Note that since we utilize a single penalty 
parameter for all model coefficients, it is required that all included series are on the same scale; hence we assume 
that prior to estimation, the series are standardized to each have zero mean and unit variance. 


Equations (2.4)-(2.5) adapt the group lasso penalty proposed by Yuan and Lin (2006) to the VARX setting. The 
group lasso partitions the parameter space into groups of related variables which are shrunk toward zero. Within a 
group, all variables are either identically set to zero or are all nonzero. Our choices of Vy and Vx create structured 
sparsity based on pre-specified groupings, which are designed to incorporate the intrinsic lagged structure of the 
VARX. The proposed “lag group” methods have a substantial advantage over popular Bayesian approaches in that 
they will both shrink least squares estimates toward zero as well as perform variable selection in a computationally 
efficient manner. 

Sparsity in the coefficient matrix is desirable when k and m are large because the conventional VARX is grossly 


overparameterized. As stated in Litterman (1984), it is widely believed in macroeconomic forecasting that small 
bits of relevant information exist throughout the data, and economic theory is not informative with regard to the 
manner in which this information is scattered. The proposed VARX-L framework provides a systematic approach 
to filter as much information as possible, assigning each bit an appropriate weight. 


The group lasso penalty function was explored in the VAR context by Song and Bickel (2011) who consider 
several structured groupings with a particular emphasis on creating a distinction between a series’ own lags and 
those of another series. Theoretical properties of the use of a group lasso penalty in the VAR setting were further 


explored by Basu et al. (2015). 


A feature of the Lag Group VARX-L is that it does not impose sparsity within a group. Song and Bickel (2011) 
attempt to circumvent this constraint by including several additional lasso penalties, but such an approach requires 
a multi-dimensional gridsearch to select penalty parameters. The penalties for the proposed Sparse Group VARX-L 


and Sparse Own/Other Group VARX-L, listed in Equations (2.6)-(2.7), instead implement the sparse group lasso 
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Figure 1: Example sparsity pattern (active elements shaded) produced by a Lag Group VARX-L 3^2 (5, 3) 



t.(2) 





(Simon et al. 2013), which extends the group lasso to allow within-group sparsity. The sparse group lasso can be 


viewed analogously to the elastic net (Zou and Hastie 2005) extended to structured penalties. 


The penalty for the Basic VARX-L adapts the lasso (2.8); it considers no structure, or can be viewed as a 


group lasso penalty that assigns each coefficient to a singleton group. In very high-dimensional scenarios, this most 
basic penalty has computational advantages as compared to more complex approaches. Finally, the penalty for the 


proposed Endogenous-First VARX-L, Equation (2.9), incorporates a nested penalty structure such that, within a 


lag, endogenous coefficients are prioritized before their exogenous counterparts. Since this penalty structure is not 


separable in the manner of Equation (2.3), its penalty function is denoted as Vy^x- 

Group VARX-L 


We first present the Lag Group VARX-L (2.4), in which the endogenous coefficients are grouped according to 
their lagged coefficient matrix for £ = 1,... ,p, and at every lag, each exogenous component series is partitioned 
into its own group. This structured grouping is expressed as 




W\ 


e=i 


j=i i=i 


U)i 


Note that since the endogenous and exogenous groups differ in cardinality, it is required to weight the penalty to 
avoid regularization favoring larger groups. This structure implies that for each endogenous series, a coefficient 
matrix at lag £ is either entirely nonzero or entirely zero. Similarly, the relationship between an exogenous and 
endogenous series at lag j will either be nonzero for all endogenous series or identically zero. A potential sparsity 
pattern generated by this structure (with fc = 3, p = 5, m = 2, and s = 3) is shown in Eigure[^with the active (i.e. 
nonzero) elements shaded. 


In comparison to Bayesian regularization methods, such as stochastic search variable selection (George et al. 


2008), estimating the Lag Group VARX-L is tractable even in high dimensions. We are able to extend the efficient 


group lasso solution method proposed by Qin et al. (2010), who utilize a block coordinate descent procedure 


and transform each “one group” subproblem to a trust-region framework. These subproblems can then be solved 







































Figure 2: Example sparsity pattern (active elements shaded) produced by an Own/Other Group VARX-L3^2(5, 3) 



efficiently via univariate optimization. Details of our algorithm are provided in section |A.3.2| of the appendix. 

The Lag Group structure is advantageous for applications in which all series tend to exhibit comparable 
dynamics, such as forecasting the disaggregate subcomponents of a composite index. It also can serve as a powerful 
tool for lag selection. However, in many settings, it may not be appropriate to give equal consideration to every 
entry in a coefficient matrix. Diagonal entries of each which represent regression on a series’ own lags, are in 
many applications more likely to be nonzero than are off-diagonal entries, which represent lagged cross-dependence 


with other components. The Own/Other Group VARX-L (2.5) allows for the partitioning of each lag matrix 


into separate groups by assigning the endogenous penalty structure 


Vy{^) = VkY, + Vk(k^)j2 

e=i e=i 

in which denotes the diagonal elements of and denotes its off-diagonal entries. 

An example of this sparsity pattern is shown in Figure The computational modifications required to utilize 
the Own/Other structure are detailed in section of the appendix. This delineation between own lags and 

other lags is often incorporated in macroeconomic forecasting. As detailed in Litterman ( |1986a| , the traditional 
Minnesota prior operates under the assumption that a series’ own past values account for most of its variation, 
hence they are shrunk by a smaller factor than realizations of other series. The strong forecasting performance 


of the VARX-L procedures that utilize the Own/Other structure in section 3.3 provides further justification for 
Litterman’s beliefs. 

In addition to the Own/Other and Lag group structures, one could consider a variety of application-dependent 


partitions. This idea was briefly explored in Song and Bickel (2011) who posit potentially segmenting financial or 


economic series based upon industry or sector characteristics in addition to a lag based grouping. However, since in 
many applications we lack a priori information about the potential relationships of the included series, we prefer 
to structure our groupings around the intrinsic structure of the VARX. 
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Figure 3: Example sparsity pattern (active elements shaded) produced by a Sparse Lag Group VARX-L 3,2 (5, 3) 
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Sparse Group VARX-L 

For certain applications, a group penalty might be too restrictive. If a group is active, all coefficients in the 
group will be nonzero, and including a large number of groups substantially increases computation time. Moreover, 
it is inefficient to include an entire group if, for example, only one coefficient is truly nonzero. The sparse group 


lasso, proposed by Simon et al. (2013) allows for within-group sparsity through a convex combination of lasso and 


group lasso penalties. The Sparse Lag Group VARX-L (2.6) results in a penalty of the form 


Vyi^) = (1 - = (1 - 

j=ii=i 

in which 0 < a < 1 is an additional penalty parameter that controls within-group sparsity. Without prior knowledge 
of the important predictors, we weight according to relative group sizes and set a = though a could also be 
estimated by cross-validation. 

The inclusion of the Li norm allows for within-group sparsity, hence even if a group is considered active individual 
coefficients within it can be set to zero. An example sparsity pattern is depicted in Figure]^ 

Since the inclusion of within-group sparsity does not create a separable objective function, conventional group 


lasso solution methods, such as coordinate descent, are no longer applicable. Following Simon et al. (2013), our 


estimation algorithm for the Sparse Lag Group VARX-L makes use of proximal gradient descent. The details of this 
approach and our implementation are provided in section |A.3.4| of the appendix. This penalty can be extended to 


alternative groupings. Gonsequently, we also consider the Sparse Own/Other Group VARX-L (2.7) as an estimation 
procedure. 
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Figure 4: Example sparsity pattern (active elements shaded) produced by a Basic VARX-L3,2(5, 3) 
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Basic VARX-L 


The Basic VARX-L (2.8|, proposed by Chiuso and Pillonetto (2010), incorporates no structure and can be 


viewed as a special case of the Sparse Lag Group VARX-L in which a = 1, resulting in penalties of the form 

p,(#) = ii$iii, v.{(3) = mw. 

The Li penalty will induce sparsity in the coefficient matrices $ and /3 by zeroing individual entries. An example 
sparsity pattern is depicted in Figure]^ 

A major advantage of the Basic VARX-L over its structured counterparts is its computational tractability. 
Our solution approach involves the use of coordinate descent, popularized for lasso problems by [Friedman et al. 
( 2010| . Coordinate descent consists of partitioning the Basic VARX-L into subproblems for each scalar element 
[#,/3]y, solving component-wise, then updating until convergence. This approach is computationally efficient since. 


in the Basic VARX-L context, each subproblem has a closed-form solution. Tseng (2001) establishes that global 


convergence arises from solving individual subproblems in the coordinate descent framework. Our solution strategy 
and algorithm are detailed section |A.3.1| of the appendix. 

2.2 An Endogenous-First Active Set 

We have previously only considered structures that assign endogenous and exogenous variables to separate 
groups. In this section, we consider a nested structure that can take into account the relative importance between 
endogenous and exogenous predictor series. 

In certain scenarios, there may exist an a priori importance ranking among endogenous and exogenous variables. 
For example, the endogenous variables could represent economic indicators of interest in a small open economy, 
with global macroeconomic indicators serving as exogenous variables. In such a scenario, it may be desirable for 
exogenous variables to enter into a forecasting equation only if endogenous variables are also present at a given lag 


£. We can consider such a structure by utilizing a hierarchical group lasso penalty (see, e.g. Jenatton et al. (2011)). 
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Figure 5: Example sparsity pattern (active elements shaded) generated by an Endogenous-First VARX-L3^2(4,4). Note that a 
row in ( 3 ^^^ can only be nonzero if the corresponding row in is also nonzero. 


The Endogenous-First VARX-L penalty function (2.9l takes the form 


v(^,^) = EE 

fci j=i 




+ 11/3 


f\ 


( 2 . 10 ) 


Under this structure, at a given lag, exogenous variables can enter the model only after the endogenous variables 


at the same lag. Note that this structure requires that s < p. It should also be noted that (2.10) decouples across 


rows, allowing for separate nested structnres across each endogenous series. This sparsity pattern is depicted in 
Figure 

Most group lasso solution methods, such as block coordinate descent, take advantage of the separability of 
groups to improve computational performance. Although the nested structure is not directly separable, based on 


the methodology of Jenatton et al. (2011), its dual can be solved in one pass of block coordinate descent. Details 


of the solution approach and our algorithm are provided in section A.3.5 of the appendix. 


3 High-Dimensional Macroeconometrics 

In this section, we evaluate our regularization procedures in three macroeconomic data applications: two 
high-dimensional and one low-dimensional. In our first two applications, we consider applying the proposed 
VARX-L procedures on the widely used set of US macroeconomic indicators originally constructed by [Stock and] 


Watson (2005). Our second example considers forecasting a small set of Canadian macroeconomic indicators and 


incorporating the previous US data as exogenous series. Section |3.1| outlines the practical implementation of our 


penalty parameter selection procedure, section 3.2 describes the benchmarks that we compare our models against, 
section [T3| details our macroeconomic applications, and section [Td] provides an expanded comparison of the relative 
forecasting performance among the competing models using the model confidence sets framework developed by 


Hansen et al. (2011). 


12 










































3.1 Practical Implementation 


The regularization parameter, A, is not known in practice and is typically chosen via cross-validation. In this 
section, we detail our strategy for selecting A. Following Friedman et al. ( 2010| , we select from a grid of potential 
penalty parameters that starts with the smallest value in which all components of [^,/3] will be zero and decreases 
in log-linear increments. This value differs for each procedure and can be inferred by their respective algorithms. 
The starting values are summarized in Table located in section |A.5| of the appendix. The number of gridpoints, 
N, as well as the depth of the grid are left to user input. A deep grid and large number of gridpoints result in 
increased computational costs and often do not improve forecasting performance. We have found that a grid depth 
^Amax and 10 gridpoints achieve adequate forecast performance in most scenarios. 

Due to time-dependence, our problem is not well-suited to traditional AT-fold cross-validation. Instead, following 
Banbura et al.| (2009), we propose choosing the optimal penalty parameter by minimizing h-step ahead mean-square 


forecast error (MSFE), in which h = 1,2, 3,... denotes the desired forecast horizon. We divide the data into three 
periods: one for initialization, one for training, and one for forecast evaluation. Define time indices Ti = ,T 2 = 


LfJ- 

We start our validation process by fitting a model using all data up to time Ti and forecast for i = 1,..., N. 

We then sequentially add one observation at a time and repeat this process until time T 2 — h. This procedure is 
illustrated in Figure 




1 


Initialization 


Penalty Parameter Selection 

-> 


Forecast Evaluation 

<-> 


H 


T, 


T2 


T 


Figure 6: Illustration of Rolling Cross-Validation 


We select A as the minimizer of 

MSFE{X,)= / e \\yth-y^+>^\\F- (3-1) 

Finally, we evaluate the h-step ahead forecast accuracy of A from time origins t = T 2 — h to t = T — h. If desired, 
additional criterion functions can be substituted. MSFE is the most natural criterion given our use of the least 
squares objective function. Rather than parallelizing the cross-validation procedure, our approach uses the result 
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from the previous period as an initialization or “warm start,” which substantially decreases computation time. The 


penalty parameter selection procedure is presented in Algorithm in the appendix. 

Multi-step Predictions 

The VARX-L framework can easily accommodate multi-step ahead forecasting. To do so, we modify our 
algorithms to calculate direct multi-step ahead forecasts. Essentially, computing direct h-step forecasts 


solving the standard VARX-L objective (2.31 while leaving a gap of h observations. 


solution 

involves 


T p 


^+ A P,($) + P,(/3) 


Per Clark and McCracken (2013), the direct h-step ahead forecast can be calculated as 


^( 1 ) ^(p) ^( 1 ) 

yt+h = £' + ^ y* + • • • + ^ yt-p+i+P ^t + ''' + /3 xt-s-i-i- 


The iterative approach, in which multi-step forecasts are computed recursively as 1—step ahead forecasts using 
predicted values is another popular technique to compute long-horizon forecasts. This approach is not directly 
extendable to the VARX setting, as we do not forecast the exogenous series. 

If iterative multi-step predictions are desired, one could instead fit the full VAR^+m. However, as shown in 


Marcellino et al. (2006), direct forecasts are more robust to model misspecification, making them more appropriate 


in high-dimensional settings. As our macroeconomic applications consider quarterly data, in section [3i3| we compute 
both 1—step and 4—step ahead forecasts. 

Selecting a Structure 

The VARX-L framework offers many choices for structured penalties suited toward a wide range of applications. 
Under scenarios in which little is known about the potential dynamic dependence of the included series, the Basic 
VARX-L makes no underlying structural assumptions. 

The Lag Group and Sparse Lag Group VARX-L structures are most appropriate when the endogenous series 
are closely related; for example, if the series of interest comprise unemployment rates segmented by state or 
census region. The Own/Other Group and Sparse Own/Other Group VARX-L structures are most appropriate 
for macroeconomic applications in which a series’ own lags are thought to have substantially different temporal 
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dependence than those of “other” series. As an example, one could consider a disparate group of series traditionally 
examined in small-scale macroeconometric forecasting applications: the US Federal Funds Rate, the GDP Growth 
Rate, and the Consumer Price Index. The Endogenous-First structure is best suited toward applications in which 
the forecasting effectiveness of the exogenous series is unknown. 

One potential diagnostic tool involves fitting the Sparse Lag Group VARX-L with both A and a selected according 
to rolling cross validation. A selected value of a close to 0 indicates evidence of strong groupwise sparsity, while a 
value close to 1 indicates unstructured sparsity, and values in the middle provide evidence for some combination of 
the two. 

Since the computational time required to apply all procedures is manageable, in practice, we suggest fitting 
several VARX-L structures and selecting the approach that achieves the best out of sample forecasting performance. 


3.2 Methods for Comparison 

A conventional VARX model selection approach in a low-dimensional setting involves fitting a VARXfc_m(€,j) 
by least squares for 0 < £ < p, 0 < j < s and selecting lag orders for both the endogenous and exogenous series 
based on an information criterion, such as Akaike’s Information Griterion (AIG) or Bayesian Information Griterion 


(BIG). Per Liitkepohl (2005), the AIG and BIG of a VARXfe_m(£, j) are defined as 


AIC(£,j) = log|El'^| + 
BIC(£,j) = log|S^’^| + 


2{k{ki + mj)) 


T 


log(T)(fc(fc£ -I- mj)) 


T 


in which jg residual sample covariance matrix obtained from the estimated YARXk^m{^, j), and |S| represents 
the determinant of E. The selected lag orders (£, j) are then chosen as the minimizer of AIG or BIG. AIG penalizes 
each model coefficient uniformly by a factor of two whereas BIG scales penalties relative to series length. Hence, 
when T is large, BIG will tend to select more parsimonious models than AIG. 

We compare our methods against least squares model selection procedures that utilize AIG and BIG to select lag 
orders. Since we are considering high-dimensional applications, in which could be ill-conditioned, we construct 
our least squares estimates using a variation of the approach developed by Neumaier and Schneider] (2001). This 
procedure constructs the least squares estimates using a QR decomposition, which obviates the need for explicit 
matrix inversion. In addition, following a heuristic proposed by 


Hansen 


(1998), we impose a ridge penalty: ({k-i + 


m ■ j)^ + (k ■ £ + m ■ j) + l)eniachine scaled by the column norms of the lagged series yt-i, ■ ■■, Yt-p, xt_i, ..., xt_s. 
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in which Cniachine denotcs machine precision. This penalty ensures that the determinant of is well-defined 
without noticeably impacting degree of freedom calculations. However, in very high-dimensional settings, where 
k{kp -\- ms) > kT, the covariance S„ is not well defined, hence information criterion based methods are not 


appropriate. The specifics of our information criterion minimization routine are detailed in Nicholson et al. (2016). 

We additionally compare our methods against two naive approaches that provide insight with regard to the 
level of temporal dependence in the data. We first consider the unconditional sample mean, which will make h-step 
ahead forecasts at time t + h based upon the average of all observed data up to time t: jt+h = \ Vi- Strong 
relative performance of the sample mean indicates that the sophisticated models of interest have low predictive 
power relative to a white noise process; the cause of which could be a number of factors, including weak temporal 
dependence or dependence of a nonlinear nature. 

Second, we consider the vector random walk model, which makes /i-step ahead forecasts based upon the most 
recent realization of the series, i.e. yt+h = Yt- Superior performance of the vector random walk indicates high 
persistence or a strong degree of temporal dependence, as is often observed in macroeconomic data. 

Finally, we compare against two more sophisticated approaches. First, we consider the popular Bayesian VAR 


with a modihed Minnesota Prior proposed by Banbura et al. (2009) (henceforth BGR). Their approach acts very 
similarly to ridge regression in that it shrinks least squares coefficients toward zero with the degree of regularization 
determined by a single penalty parameter. This parameter is chosen according to rolling cross validation as described 
in section 


3.1 


As in Banbura et al. (2009), instead of fitting a VARXfc_m(p, s), we fit a VARfe+m(p) and select the 


regularization parameter as the minimizer of ft.—step ahead MSFE across the k endogenous series. This allows 
BGR’s method to make forecasts utilizing information from both the endogenous and exogenous series, and creates 
a direct comparison with our VARX-L framework. 

BGR’s approach modifies the Minnesota Prior to make it computationally tractable in high dimensions, but 
it does not does not return sparse solutions. Superior performance of the VARX-L methods relative to BGR’s 
approach provides evidence as to the importance of imposing sparsity in obtaining accurate forecasts. Details of 
our implementation of BGR’s procedure are provided in section |A.4| in the appendix. 


In addition, we compare against a factor model (Anderson 1984 Stock and Watson, 2006). In a manner similar 
to principal components analysis, a factor model attempts to find a low-rank structure in the data that adequately 
accounts for a high percentage of the variation across the series. Our implementation forecasts using a reduced-rank 
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structure that accounts for at least 95 percent of the total variance of the included series. For implementation details, 


consult Ensor (2013). We expect that the factor model will perform well under scenarios in which a small subset of 


common factors drive the underlying dynamics of the larger cross-section of modeled series. 

3.3 Macroeconometric Applications 

We evaluate our methods on the large and widely utilized macroeconomic dataset created by [Stock and Watson] 


(2005) and later amended by Koop (2011). The dataset consists of 168 quarterly US macroeconomic indicators 


containing information about various aspects of the economy, including income, industrial production, employment, 
stock prices, interest rates, exchange rates, etc. The data ranges from Quarter 2 of 1959 to Quarter 3 of 2007 


(T = 195). Per Koop (2011), the series can be categorized into several levels; we consider the following four: 


• Small {k = 3): Three variables (Federal Funds Rate, Consumer Price Index, Gross Domestic Product growth 
rate). Core group, typically used in simple dynamic stochastic generalized equilibrium models; 

• Medium {k = 20): Small plus 17 additional variables containing aggregated economic information (e.g., 
consumption, labor, housing, exchange rates); 

• Medium-Large (k = 40): Medium plus 20 additional aggregate variables; 

• “Large:” Medium-Large plus 128 additional variables, consisting primarily of the components that make up 
the aggregated information (k=168). 


For a detailed description of each set of variables, consult Koop (2011). As Banbura et al. (2009) found that 


the greatest improvements in forecast performance occurred with the medium VAR, that will be our initial focus. 
We will attempt to forecast the medium set of indicators {k = 20) while using the additional variables from the 
medium-large category as exogenous predictors (m = 20). 

Before estimation, each series is transformed to stationarity according to the specifications provided byjStock andj 


Watson (2005) and standardized by subtracting the sample mean and dividing by the sample standard deviation. 


Quarter 2 of 1976 to Quarter 3 of 1992 is used for penalty parameter selection while Quarter 4 of 1992 to Quarter 
3 of 2007 is used for forecast evaluation. The MSFE (relative to the sample mean) over the evaluation period for 
each model is reported in Table In addition, Table records the average sparsity ratio for each VARX-L model. 
The sparsity ratio denotes the average proportion of model coefficients that are set to zero across the evaluation 
period. A sparsity ratio of one implies that all coefficients are set to zero whereas a least squares fit that includes 
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all coefficients has a sparsity ratio of zero. 


Table 2: One-step and four-step ahead MSFE of A: = 20 macroeconomic indicators (relative to sample mean) with m = 20 exogenous 
predictors p = 4, s = 4. 


Model/VARX-L Penalty Structure 

One-step ahead Out of Sample Relative MSFE 

Four-step ahead Out of Sample Relative MSFE 

Basic 

0.8064 

0.9672 

Lag Group 

0.8747 

0.9798 

Own/Other Group 

0.7773 

0.9582 

Sparse Lag Group 

0.8206 

0.9702 

Sparse Own/Other Group 

0.7823 

0.9590 

Endogenous-First 

0.8531 

0.9748 

VARX with lags selected by AIC 

5.0223 

7.8363 

VARX with lags selected by BIG 

0.9455 

1.1603 

BGR’s Bayesian VAR 

0.9414 

0.9765 

Factor Model 

0.9904 

0.9819 

Sample Mean 

1.0000 

1.0000 

Random Walk 

1.9909 

1.8706 


Table 3: Average sparsity ratio (proportion of least squares coefficients set to zero) of VARX-L models in forecasting fc = 20 

macroeconomic indicators with m = 20 exogenous predictors p = 4, s = 4, k^p + kms = 3, 200 


VARX-L Penalty Structure 

One-step ahead Average Sparsity Ratio 

Four-step ahead Average Sparsity Ratio 

Basic 

0.9090 

0.9636 

Lag Group 

0.2259 

0.6315 

Own/Other Group 

0.3049 

0.7030 

Sparse Lag Group 

0.6743 

0.7095 

Sparse Own/Other Group 

0.7118 

0.8853 

Endogenous-First 

0.1411 

0.6068 


Most of our VARX-L procedures substantially outperform the benchmarks at both forecast horizons, with the 
Own/Other Group VARX-L and Sparse Own/Other Group VARX-L achieving the best performance. This provides 
evidence that making the distinction between a series’ own lags and those of other series can improve forecasts in 
macroeconomic applications. The relatively poor performance of the Lag Group VARX-L suggests that a lag based 
grouping may be too restrictive for such a disparate group of series and hence not appropriate for this application. 

The imposition of sparsity appears to be crucial, as BGR’s Bayesian VAR performs worse than all VARX-L 
procedures at both horizons except for the Lag Group VARX-L at ft. = 4. It performs very similarly to the least 
squares VARX with lags selected by BIG at ft = 1, but slightly better at ft = 4. The factor model also performs very 
poorly at both forecast horizons, providing evidence that a low-rank structure is not appropriate for this application. 

The VARX with lags selected by AIC is substantially outperformed by the sample mean at both horizons, 
whereas the VARX with lags selected by BIG slightly outperforms it at ft = 1, but is outperformed at ft = 4. Since 
AIG imposes a weaker penalty for higher lag orders than BIG, it has a tendency to construct overparameterized 
models, whereas BIG has a tendency to underfit and misses out on potential dynamic relationships that the VARX-L 
procedures are able to capture. Since neither approach imposes variable selection, these models tend to result in 
very noisy multi-step ahead forecasts. 
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We observe that Basic VARX-L imposes the most sparsity at both horizons, which is expected as it is not bound 
to any group structure. The Sparse Own/Other VARX-L, which has a very flexible group structure as well as 
the ability to impose within-group sparsity returns the second most sparse model across both horizons. The Lag 
Group VARX-L, Own/Other Group VARX-L and the Endogenous-First VARX-L, all lacking the ability to impose 
within-group sparsity, return relatively less sparse solutions at /i = 1, but impose much more sparsity at /i = 4, 
which could reflect the decreasing predictive power of the data in forecasting longer horizons. 

As an additional application to showcase the tractability of our methods on high-dimensional problems, we 
attempt to forecast the medium-large (k=40) set of indicators using the remaining variables in the large (m=128) 
set as exogenous predictors. The forecast performance is recorded in Table and the average sparsity ratios of the 
VARX-L procedures are recorded in Table 


Table 4: One-step ahead MSFE and four-step ahead MSFE of fc = 40 macroeconomic indicators (relative to sample mean) with 
m = 128 exogenous predictors p = 4, s = 4. 


Model/VARX-L Penalty Structure 

One-step ahead Out of Sample Relative MSFE 

Four-step ahead Out of Sample Relative MSFE 

Basic 

0.7810 

0.9824 

Lag Group 

0.8550 

0.9756 

Own/Other Group 

0.7587 

0.9680 

Sparse Lag Group 

0.7899 

0.9575 

Sparse Own/Other Group 

0.7917 

0.9952 

Endogenous-F irst 

0.8278 

0.9657 

VARX with lags selected by AIC 

1.9789 

2.5105 

VARX with lags selected by BIG 

0.9999 

0.9999 

BGR’s Bayesian VAR 

0.9062 

0.9646 

Factor Model 

0.9758 

1.0166 

Sample Mean 

1.0000 

1.0000 

Random Walk 

1.6763 

1.6763 


Table 5: Average sparsity ratio (proportion of least squares coefficients set to zero) of VARX-L models in forecasting fc = 40 

macroeconomic indicators with m = 128 exogenous predictors p = 4, s = 4, k^p -f kms = 26,880 


VARX-L Penalty Structure 

One-step ahead Average Sparsity Ratio 

Four-step ahead Average Sparsity Ratio 

Basic 

0.9683 

0.9862 

Lag Group 

0.6300 

0.9584 

Own/Other Group 

0.8144 

0.9257 

Sparse Lag Group 

0.8577 

0.8711 

Sparse Own/Other Group 

0.9912 

0.9825 

Endogenous-First 

0.0702 

0.1697 


In this application, we observe that the Own/Other Group VARX-L achieves the best forecasting performance 
at h = 1, followed by the Basic VARX-L. At /i = 4, the Sparse Lag Group VARX-L achieves the best performance. 
At = 1, as in the previous application, all VARX-L procedures outperform the benchmark models. However, at 
/i = 4, BGR’s Bayesian VAR achieves the second best forecasting performance. 

Most of our VARX-L methods impose considerably more sparsity in this larger application as compared to 
Table which is to be expected as many of the included exogenous series are likely either irrelevant or redundant 
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for forecasting purposes. Strangely, the Endogenous-First VARX-L imposes less sparsity than in the previous 


application, but this could be the result of its unique nested penalty structure which forces endogenous coefficients 
to be active along with their exogenous counterparts. It is possible that the relatively poor performance of several 
VARX-L models at h = 4 is the result of not imposing enough sparsity. For example, the Sparse Own/Other Group 
VARX-L imposes less sparsity at h = 4 than h = 1. 

Canadian Macroeconomic Data Application 

We next consider a low-dimensional application in which we forecast Canadian indicators using US macroeconomic 
series as exogenous predictors. As a small, relatively open economy Canada’s macroeconomic indicators have been 


shown to be very sensitive to their US counterparts. In particular, Racette and Raynauld (1992) and Cushman 


and Zha (19971 demonstrate that the US Gross Domestic Product and Federal Funds Rate are very influential in 


modeling Canada’s analogous monetary policy proxy variables. Taking this into consideration, we forecast k = 4 
Canadian macroeconomic series using our previously defined medium dataset as exogenous predictors (to = 20). 
The endogenous series are Canadian Ml (a measure of the liquid components of money supply), Canadian Industrial 
Production, Canadian GDP (relative to 2000), and the Canada/US Exchange Rate. 

The Canadian series range from Quarter 3 of 1960 to Quarter 3 of 2007. Quarter 3 of 1977 to Quarter 2 of 
1993 is used for penalty parameter selection while Quarter 3 of 1993 to Quarter 3 of 2007 is used for forecast 
evaluation {T = 191). In addition to the standard benchmarks, we also compare against our procedures in the 
VAR-L framework, in which the exogenous predictors are ignored. Our results are summarized in Tables and 


Table 6: One-step ahead and four-step ahead MSFE (relative to sample mean) for VARX forecasts of A: = 4 Canadian macroeconomic 
indicators with m = 20 exogenous predictors p = 4, s = 4 and VAR forecasts of 4 Canadian macroeconomic indicators, p = 4. 


Model/VARX-L Penalty Structure 

One-step ahead Out of Sample RMSFE 

Four-step ahead Out of Sample RMSFE 

Basic 

0.8406 

0.9187 

Lag Group 

0.8357 

0.9285 

Own/Other Group 

0.8376 

0.9143 

Sparse Lag Group 

0.8274 

0.9129 

Sparse Own/Other Group 

0.8327 

0.9181 

Endogenous-First 

0.8454 

0.9593 

VARX with lags selected by AIC 

1.3680 

1.7739 

VARX with lags selected by BIG 

0.8785 

1.0941 

BGR’s Bayesian VAR (with exogenous series) 

1.0058 

0.9748 

Model/VAR-L Penalty Structure 

One-step ahead Out of Sample RMSFE 

Four-step ahead Out of Sample RMSFE 

Basic 

0.8465 

0.9645 

Lag Group 

0.8575 

0.9965 

Own/Other Group 

0.8491 

0.9604 

Sparse Lag Group 

0.8506 

0.9623 

Sparse Own/Other Group 

0.8493 

0.9655 

VAR with lag selected by AIC 

0.9190 

1.1365 

VAR with lag selected by BIG 

0.8785 

1.0941 

BGR’s Bayesian VAR (without exogenous series) 

1.0066 

0.9891 

Factor Model 

0.9033 

1.0152 

Sample Mean 

1.0000 

1.0000 

Random Walk 

1.3388 

1.7180 


20 




















Even in this low dimension, we find that all of our models substantially outperform the AIC and BIC benchmarks 


across both forecast horizons, with the Sparse Lag Group VARX-L achieving superior performance at both horizons. 
This low-dimensional example is better suited toward lag based groupings than our previous application. Consequently, 
the relative forecasting performance of the Lag Group VARX-L and Sparse Lag Group VARX-L improve substantially. 

In addition, we find that our methods are able to effectively leverage relevant information from the exogenous 
predictors, as every VARX-L procedure achieves better out of sample performance than its corresponding VAR-L. 
Conversely, the information criterion based VARX approaches fail to outperform their VAR counterparts. At h = 1 
and h = A, BIC produces identical forecast error in both the VAR and VARX setting, indicating that it never 
selects any exogenous series. 

BGR’s Bayesian VAR performs poorly in this scenario, achieving similar forecast performance to the sample 
mean across both horizons, both with and without exogenous series, outperforming only the Lag Group VAR-L at 
h = A. Its poor performance across both settings suggests that imposing sparsity is desirable even in low-dimensional 
applications. The factor model also performs poorly, indicating that a low-rank structure may not be appropriate 
for this small-scale macroeconomic forecasting application. 


Table 7: Average sparsity ratio (proportion of least squares coefficients set to zero) of VARX-L models in forecasting k = i 

macroeconomic indicators with m = 20 exogenous predictors p = 4, s = 4, k^p + kms = 384 and VAR-L models lacking exogenous series 
p = 4, fc^p = 64. 


Model/VARX-L Penalty Structure 

One-step ahead Average Sparsity Ratio 

Four-step Average Sparsity Ratio 

Basic 

0.7882 

0.9091 

Lag Group 

0.5953 

0.7733 

Own/Other Group 

0.5909 

0.8125 

Sparse Lag Group 

0.7282 

0.7979 

Sparse Own/Other Group 

0.6764 

0.8554 

Endogenous-First 

0.0005 

0.0236 

Model/VAR-L Penalty Structure 

One-step ahead Average Sparsity Ratio 

Four-step ahead Average Sparsity Ratio 

Basic 

0.6151 

0.7958 

Lag Group 

0.0000 

0.8403 

Own/Other Group 

0.0002 

0.6104 

Sparse Lag Group 

0.2114 

0.3114 

Sparse Own/Other Group 

0.2494 

0.7000 


We observe that less sparsity is imposed in the VAR framework as opposed to the VARX. At ft. = I, the Lag 
Group VAR-L imposes no sparsity and the Own/Other Group VAR-L imposes a negligible amount. This relative 
lack of sparsity is likely the result of the VAR-L models already operating from a reduced model space as compared 
to their VARX-L counterparts. However, under both frameworks, considerably more sparsity is imposed at ft = 4. 
Across all models, we hnd that the level of sparsity generally increases with the number of potential coefficients as 
well as across longer forecast horizons. 
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3.4 Evaluating Model Performance with Model Confidence Sets 


In addition to simply evaluating model performance based on relative MSFE, we also consider applying the model 


confidence sets framework proposed by Hansen et al. (20111 which conducts a sequence of pairwise hypothesis tests 
in order to construct a set of “superior models,” within which the null hypothesis of equal predictive ability cannot 
be rejected. 

This procedure starts by computing the sum of squared forecast errors (SSFE) for each candidate model over 
the evaluation period (T 2 + 1 through T). The SSFE for model i at time t is defined as SSEE^^t = Ijy* — yf^||p. 
After constructing the SSFE, at each time t, the MCS procedure computes the pairwise loss differential for each 
pair of models i < j: dij^t = SSFE^^t — SSFEj^*. 

Relative model performance is assessed according to the hypothesis of equal predictive ability, which tests 
whether pairwise loss averaged over time differs from zero across all model combinations. In order to evaluate this 
hypothesis, we construct the test statistic 


= 


var(dij) 


(3.2) 


in which Yiit=T 2 ) is estimated according to a block bootstrap procedure. The 

asymptotic distribution of this test statistic is non-standard, hence it is also estimated using a block bootstrap 
in a manner similar to the variance. 

The model confidence sets algorithm initializes by setting M equal to all candidate models and iteratively testing 
for equal predictive ability using the aggregate test statistic ky I- If equal predictive ability is 

rejected at a given confidence level 1 — a, the worst performing model (i.e. the model with the largest average 
pairwise loss differential) is removed from M and the procedure is repeated on the reduced subset of models. The 
algorithm terminates once equal predictive ability cannot be rejected. Eor more details on the MCS methodology, 
consult Hansen et al. (2011| and Bernard! and Catania (20141. 


We utilize the R package MCS (Bernardi and Catania, 2014) to implement this procedure. Following the package’s 


default settings, we choose a = .15 and perform 5000 bootstrap replications. Our resulting sets of equal predictive 
ability are displayed in Table 
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Table 8: Model Sets M of Equal Predictive Ability (a = 0.15). Within each set of models, we cannot reject 
the null hypothesis of equal predictive ability, though they achieve superior forecasting performance relative to all 
excluded models. 


Application 

One-Step M 

Four-Step M 



Own/Other VARX-L 
Sparse Own/Other VARX-L 

Own/Other VARX-L 
Sparse Own/Other VARX-L 


Stock and Watson 

12005 \ US Macroeconomic Data {k = 20, m = 20) 


Stock and Watson 

(2005i US Macroeconomic Data {k = 40, m = 128) 

Own/Other VARX-L 

Sparse Lag Group VARX-L 
Own/Other VARX-L 
Endogenous-First VARX-L 
BGR’s Bayesian VAR 

Canadian Macroeconomic Data (Ai = 4, m = 20) 

Sparse Own/Other VARX-L 
Sparse Lag Group VARX-L 

Sparse Own/Other VARX-L 
Basic VARX-L 

Sparse Lag Group VARX-L 
Own/Other VARX-L 


We find that the MCS procedure is able to not only distinguish between the forecasting performance of our 
VARX-L models and the benchmarks, but also within the VARX-L class of models. In the Canadian macroeconomic 
data application, no VAR-L models are included at either forecast horizon, indicating that the exogenous series 
have predictive power. We additionally find that either the Sparse Own/Other Group VARX-L or Own/Other 
Group VARX-L are in the MCS in every application. This provides further evidence supporting the use of a group 
structure that distinguishes between a series’ own lags and those of other series in macroeconomic applications. The 
only application in which a competitor’s model is included in the MCS is the large US macroeconomic application 
(fc = 40, TO = 128) at ft, = 4, though several VARX-L models are also included. The relatively poor performance of 
the VARX-L models in this scenario suggests that long-horizon forecasts in large models deserve additional scrutiny. 


4 Extending the VARX-L for Unit-Root Nonstationarity 


In some scenarios, it may not be appropriate to shrink every coefficient toward zero. In traditional time series 
analysis, economic series that exhibit persistence are transformed to stationarity. However, this framework has 
several drawbacks. First, if no pre-established transformation guidelines are available, this process can be labor 


intensive and subjective. Second, as stated by Kennedy (20031, stationarity transformations destroy information 


about the long-run relationships of economic variables. Ideally, to effectively forecast using all available information, 
it would be preferable to work directly with the untransformed series. In this section, we outline a possible 
extension that allows for shrinking toward reference models, such as a vector random walk, that can account for 
mild non-stationarity, which is ubiquitous in macroeconomic data. 
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The “Minnesota” VARX-L 

The proposed VARX-L models can easily be modified to shrink coefficients toward a known constant matrix. 
Shrinking toward constant matrices Cy G G results in a slightly modified objective of the form 




lly,-u- + A Vy(^ - Cy) + V,((3 - c,) . 


(4.1) 


in which Yt = [yj,.. • ,ytLp] and X* = [x^, -. ■ ,x7_J. 

Let [$,/3]'^(Cp, Cj:) denote a solution to this problem. Now, by a change of variables $ = $ — Cy and 
f3 = 13 — Cx , we obtain the equivalent problem 


T 


niin„ 2^ ||yt - - CyYt-i - - C,Xt_i - /3X*_i||:^ + A( iPy($) + 

v <^ .(3 t=i 


which can be expressed as 


min^ E lly* - ^ - ^^*-1 - /3Xt-i||| + a( + Vx{l3) ) , 

t=i 


in which Yt = Yt — CyTj_i — C 2 ;X 4 _i. We can view the solution to this transformed problem as [$,,3]^(0,0) 


operating on yt- Hence, transforming back to the setting of Equation (4.1), we find that 


[$,/3]\Cy,C,) = [Cy,C,] + [$,/3]^(0fcxfep,0fexr„s). 


As an example, consider Cy = [Ifc,Ofcxfe, • ■ ■ ,0/cxfc]j Cj, = Ok-Kms^ which implements a variant of the Minnesota 
prior, shrinking the VARX-L model toward a vector random walk. We refer to this extension as the “Minnesota” 
VARX-L. It could be very useful in economic applications as it is widely believed that many persistent macroeconomic 


time series can be well approximated by a random walk (Litterman 1979). 


In order to validate this alternative approach, we follow the methodology of Banbura et al. (2009), who also 


utilize the data from Stock and Watson (2005), but eschew stationarity transformations and work directly with 


the untransformed series. We again apply our VARX-L forecasting procedures by forecasting the aforementioned 
medium set of (fc = 20) series using the remaining 20 variables in the medium large set as exogenous predictors. 
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Table 9: One-step and four-step ahead MSFE (relative to a random walk) for k = 20 nonstationary macroeconomic indicators with 
m=20 exogenous predictors which shrink toward a vector random walk. 


Model/Minnesota VARX-L Penalty Structure 

One-step Ahead Out of Sample RMSFE 

Four-step Ahead Out of Sample RMSFE 

Basic 

0.8173 

0.9460 

Lag Group 

0.9450 

0.9590 

Own/Other Group 

0.8155 

0.9520 

Sparse Lag Group 

0.9858 

0.9702 

Sparse Own/Other Group 

0.8808 

0.9550 

Endogenous-F irst 

0.9746 

0.9518 

VARX with lag selected by AIC 

1.2764 

1.1896 

VARX with lag selected by BIG 

1.2764 

1.1896 

BGR’s Bayesian VAR 

1.3475 

1.0083 

Factor Model 

4.7979 

1.5968 

Sample Mean 

11.304 

5.7747 

Random Walk 

1.0000 

1.0000 


but choose not to perform any stationarity transformations and instead shrink toward a vector random walk. 

One advantage of not applying stationarity transformations is that it allows us to utilize more of our data. The 
data used in sectionextends to Quarter 4 of 2008, but one series, non-borrowed depository institutional reserves 
(FMRNBA), becomes negative in early 2008 due in part to changes in both monetary policy and Federal Reserve 


accounting (Ip 2008). The stationarity transformation guidelines provided by Stock and Watson (2005) for this 
series propose taking the first difference of logs, which is obviously not appropriate for negative values. 

Quarter 3 of 1976 to Quarter 2 of 1993 are used for penalty parameter selection while Quarter 3 of 1993 through 
Quarter 4 of 2008 are used for forecast evaluation (T=200). In this application, we also shrink BGR’s Bayesian 
VAR toward a random walk. Our results are summarized in Table and the average sparsity ratios are recorded 
in Table ITOl 


Table 10: Average sparsity ratios (proportion of least squares coefficients set to zero) of VARX-L models in forecasting k = 20 
nonstationary macroeconomic indicators with m = 20 exogenous predictors p = 4, s = 4, k^p 4- kms = 3200 


VARX-L Penalty Structure 

One-step ahead Average Sparsity Ratio 

Four-step ahead Average Sparsity Ratio 

Basic 

0.9746 

0.9652 

Lag Group 

0.7650 

0.7595 

Own/Other Group 

0.7602 

0.7533 

Sparse Lag Group 

0.8484 

0.8842 

Sparse Own/Other Group 

0.8004 

0.8099 

Endogenous-First 

0.7594 

0.7875 


We find the each of these Minnesota VARX-L procedures outperform the random walk at both forecast horizons 
with the Own/Other Group Minnesota VARX-L achieving the best out of sample performance at /i = 1 and the 
Basic VARX-L performing the best at /i = 4. 

We observe that under this scenario, the choice of structure substantially affects forecasting performance. Lag 
based groupings, such as the Lag Group, Sparse Lag Group, and Endogenous-First perform relatively poorly at 
h = 1 but slightly improve relative to other methods at ft. = 4, however they still outperform the naive methods 
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across both horizons. Their reduced relative performance is likely due to their inability to distinguish between the 
diagonal random walk component and the coefficients on other lags in the lag one coefficient matrix 

AIC and BIC are not well suited toward a nonstationarity setting, hence are completely uninformative, selecting 
lag orders of p = 1 and s = 0 at every point in time across both horizons. BGR’s procedure, despite the imposition 
of a random walk prior, produces inferior forecasts to both the VARX-L procedures and the naive random walk. 
The factor model is also poorly suited for this application and is substantially outperformed by the random walk. 

A considerable amount of sparsity is imposed across forecast horizons, regardless of structure with the Basic 
VARX-L returning the most sparse models. This indicates that a greater level of sparsity may be more appropriate 
when shrinking toward a reference model as opposed to identically toward zero. 


5 Simulation Scenarios 

In this section, we consider evaluating the forecasting performance of our procedures on several simulated 
multivariate time series conforming to different sparsity patterns, with one constructed to be advantageous for each 
proposed structure. Note that since the factor model is designed to forecast well in low-rank regimes as opposed to 
sparse regimes, due to its anticipated poor performance, we omit it from this section. 

Our objective is to quantify relative performance under both matched and unmatched model sparsity and penalty 
function structures. All simulations operate on a VARX 5 5 ( 4 ,4) of length T = 100, and each simulation is repeated 
100 times. The choice of p = s = 4 was selected because it represents one year of dependence for quarterly series, 
which is a common frequency of macroeconomic data. The middle third of the data is used for penalty parameter 
selection while the last third is used for forecast evaluation. Under the first 5 scenarios, is distributed according 
to a multivariate normal distribution with mean O 5 and covariance (0.01) x I^; the sixth scenario utilizes a more 
general specification. We do not include an intercept in any simulation scenarios. The coefficient matrix from 
each simulation scenario was designed to ensure that a stationary process would be generated. This procedure is 
elaborated upon in Nicholson et al. ( 2016| . 

In order to simulate from a VARX 5 5 ( 4 ,4), we start by constructing a VARio(4). Denoting the first 5 series as 
yt and the second 5 as Xj, we simulate according to the unidirectional relationship 


E 
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Figure 7: Sparsity Pattern Scenario 1: Unstructured Sparsity. Darker shading represents coefficients that are larger in magnitude. 



Table 11: Out of sample MSFE of one-step ahead forecasts after 100 simulations: Scenario 1. Standard errors are shown in parentheses. 


Model/VARX-L Penalty Structure 

MSFE 

MSFE Relative to Sample Mean 

Basic 

0.0645 (0.0012) 

0.0454 

Lag Group 

0.0755 (0.0010) 

0.0532 

Own/Other Group 

0.0734 (0.0010) 

0.0517 

Sparse Lag Group 

0.0724 (0.0009) 

0.0510 

Sparse Own/Other Group 

0.0699 (0.0009) 

0.0492 

Endogenous-First 

0.0779 (0.0010) 

0.0549 

VARX with lags selected by AIC 

0.1040 (0.0017) 

0.0733 

VARX with lags selected by BIG 

0.1183 (0.0032) 

0.0833 

BGR’s Bayesian VAR 

0.3675 (0.0124) 

0.2590 

Sample Mean 

1.4187 (0.0681) 

1.0000 

Random Walk 

0.8416 (0.0272) 

0.5932 


in which G denotes the dependence structure of the exogenous series X( (which follows the same sparsity 

pattern as 4>^), and u* The residual covariance is set to 0.01 x .^10 for the first 5 scenarios; the 

covariance used in Scenario 6 is depicted in Figure [T^ 

Scenario 1: Unstructured Sparsity 

We first consider a scenario in which the sparsity is completely random; our sparsity pattern was generated 
in such a manner that each coefficient was given an equal probability (10 percent) of being active, resulting in a 
coefficient matrix in which roughly 90 percent of coefficients are zero. Under such a design, we should expect superior 
performance from the Basic VARX-L, which assumes no group structure. We do not expect such a structure to 
be a common occurrence in macroeconomic applications, but it may be present in other application areas, such as 
internet traffic in which the included series can differ substantially and will likely not exhibit any group structure. 
This sparsity pattern is depicted in Figure and the results are summarized in Table [TT| 

In this scenario, as expected, we find that the Basic VARX-L achieves the best performance. Of the structured 
methods, the Sparse Own/Other VARX-L performs the best, as it can partially accommodate this sparsity pattern. 
As expected, the other approaches, which impose a structure that is not present in the data suffer from degraded 
forecasts, but all structured approaches substantially outperform the AlC and BIG benchmarks. BGR’s Bayesian 
VAR, which cannot perform variable nor lag order selection, achieves substantially worse forecast performance than 
both information criterion based methods. 
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Figure 8: Sparsity Pattern Scenario 2: Lag Group Sparsity 



Table 12: Out of sample MSFE of one-step ahead forecasts after 100 simulations: Scenario 2. Standard errors are shown in parentheses. 


Model/VARX-L Penalty Structure 

MSFE 

MSFE Relative to Sample Mean 

Basic 

0.0786 (0.0012) 

0.1397 

Lag Group 

0.0709 (0.0011) 

0.1260 

Own/Other Group 

0.0713 (0.0011) 

0.1268 

Sparse Lag Group 

0.0739 (0.0012) 

0.1314 

Sparse Own/Other Group 

0.0742 (0.0011) 

0.1319 

Endogenous-First 

0.0720 (0.0011) 

0.1280 

VARX with lags selected by AIC 

1.0084 (0.0273) 

1.7933 

VARX with lags selected by BIG 

0.9927 (0.0282) 

1.7654 

BGR’s Bayesian VAR 

0.5769 (0.0146) 

1.0259 

Sample Mean 

0.5623 (0.0123) 

1.0000 

Random Walk 

1.1279 (0.0322) 

2.0058 


Scenario 2: Lag Group Sparsity 

We next consider a scenario in which and are dense with coefficients of the same magnitude, and 
all other coefficients are set to zero. Such a sparsity pattern may be present in disaggregated macroeconomic 
series, such as agricultural price indices which follow a purely seasonal autoregressive relationship and exhibit a 
substantial degree of cross-dependence. Under such a design, we should expect superior performance from the Lag 
Group VARX-L, which partitions all coefficients within a lag to the same group. This sparsity pattern is depicted 
in Figure]^ and the results are summarized in Table 

As expected, we find that the Lag Group VARX-L achieves the best performance and all structured approaches 
outperform the Basic VARX-L. Under this scenario, all VARX-L procedures offer a substantial improvement over 
the benchmarks. This is likely a result of their ability to effectively leverage the strong signal from the exogenous 
predictors. Note that although the AIC and BIG benchmarks utilize this exogenous information, they are restricted 
to select from models of sequentially increasing lag order, hence they cannot accommodate this sparsity pattern 
and likely overfit, resulting in comparable performance to a random walk. BGR’s Bayesian VAR improves upon the 
information criterion based benchmarks, but since it cannot perform variable selection, it performs substantially 
worse than all VARX-L methods. 
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Figure 9: Sparsity Pattern Scenario 3: Structured Lagwise, Unstructured within Lag 




■--V 





















i 
















■ 



■ 
































■ 

■ 









































1 



















■ 


















a 



1 













LB 





















od) 0(2) o^^) 


Table 13: Out of sample MSFE of one-step ahead forecasts after 100 simulations: Scenario 3. Standard errors are shown in 

parentheses. 


Model/VARX-L Penalty Structure 

MSFE 

MSFE Relative to Sample Mean 

Basic 

0.0665 (0.0008) 

0.1258 

Lag Group 

0.0696 (0.0008) 

0.1317 

Own/Other Group 

0.0699 (0.0009) 

0.1322 

Sparse Lag Group 

0.0677 (0.0008) 

0.1281 

Sparse Own/Other Group 

0.0683 (0.0008) 

0.1293 

Endogenous-First 

0.0711 (0.0009) 

0.1345 

VARX with lags selected by AIC 

0.1300 (0.0019) 

0.2458 

VARX with lags selected by BIG 

0.2501 (0.0061) 

0.4730 

BGR’s Bayesian VAR 

0.7568 (0.0515) 

1.4314 

Sample Mean 

0.5287 (0.0275) 

1.0000 

Random Walk 

1.3000 (0.0731) 

2.4588 


Scenario 3: Structured Lagwise Sparsity, Unstructured Within-Lag 

Our third scenario can be thought of as a hybrid of Scenarios 1 and 2. As in Scenario 2, certain coefficient matrices 
are set identically to zero; only matrices $(^),/3(^), and /^(^^ contain nonzero coefficients. Additionally, 

in a similar manner to Scenario 1, sparsity within each lag is unstructured. This scenario can be viewed as 
a less restrictive and more realistic version of the structure presented in Scenario 2 as it allows the degree of 
cross-dependence to vary across components. In such a scenario, we should expect procedures that allow for 
within-group sparsity, such as the Sparse Lag Group VARX-L and Basic VARX-L to achieve the best forecast 
performance. This sparsity pattern is depicted in Figure]^ and the results are summarized in Table 

Under this scenario, the Basic VARX-L achieves the best performance, followed closely by the Sparse Lag Group 
VARX-L. Unlike Scenario 2, since this structure exhibits dependence in the first lag, the information-criterion based 
benchmarks are able to capture a portion of the true underlying structure in both the endogenous and exogenous 
series and thus substantially outperform the naive benchmarks. However, since they cannot account for within-lag 
sparsity, they are still considerably outperformed by all VARX-L methods. As in Scenario 1, BGR’s Bayesian VAR 
performs very poorly, since it cannot perform variable or lag order selection. 

Scenario 4: Sparse and Diagonally Dominant 

Our final scenario consists of a diagonally-dominant sparsity structure, in which all diagonal elements in $(^) 
and are equal in magnitude, whereas all off-diagonal endogenous coefficients are set to zero. As in scenario 
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Figure 10: Sparsity Pattern Scenario 4: Sparse and Diagonally Dominant 



Table 14: Out of sample MSFE of one-step ahead forecasts after 100 simulations: Scenario 4. Standard errors are shown in parentheses. 


Model/VARX-L Penalty Structure 

MSFE 

MSFE Relative to Sample Mean 

Basic 

0.0669 (0.0008) 

0.0406 

Lag Group 

0.0720 (0.0008) 

0.0437 

Own/Other Group 

0.0626 (0.0008) 

0.0380 

Sparse Lag Group 

0.0729 (0.0011) 

0.0442 

Sparse Own/Other Group 

0.0625 (0.0008) 

0.0379 

Endogenous-First 

0.0725 (0.0011) 

0.0440 

VARX with lags selected by AIG 

0.1043 (0.0015) 

0.0633 

VARX with lags selected by BIG 

0.1044 (0.0015) 

0.0634 

BGR’s Bayesian VAR 

0.7741 (0.0394) 

0.4702 

Sample Mean 

1.6460 (0.0902) 

1.0000 

Random Walk 

0.7512 (0.0390) 

0.4563 


2, the coefficients in and are identical in magnitude. This structure incorporates the belief posited by 


j(4) 


Litterman (1986a I that macroeconomic series’ own lags are more informative in forecasting applications than lags 


of other series. Under this setting, one would expect superior performance from the Own/Other Group VARX-L. 
The sparsity pattern is depicted in Figure [l0| and the simulation results are summarized in Table 14 


Under Scenario 4, as expected, the Own/Other Group and Sparse Own/Other Group VARX-L achieve superior 
forecasts. Since the magnitude of coefficients within a lag matrix varies substantially, structures that utilize lag-based 
groupings, such as the Lag Group and Endogenous-First VARX-L are unable to capture this discrepancy and thus 
perform relatively poorly. However, they still substantially outperform the benchmark procedures. We again find 
that VARX with lags selected by AIG and BIG perform very poorly, as they are restricted to select from sequentially 
increasing lag orders and cannot account for within-lag sparsity. BGR’s Bayesian VAR also performs poorly for 
similar reasons. 

Scenario 5: No within-lag Sparsity 

Next, we consider a scenario in which the entire 5 x 40 coefficient matrix is dense. Although the coefficient matrix 
is not sparse, as in the other scenarios, the information criterion based methods can impose sparsity by truncating 
the maximum lag orders p and s. The relative magnitudes of the coefficient matrix are depicted in Figure p4^ This 
scenario serves to give an advantage to the procedures that do not impose any within-lag sparsity: the information 


criterion based methods as well as BGR’s Bayesian VAR. The results from this scenario are depicted in Table 15 


We observe that despite a lack of sparsity in the data generating process, all VARX-L methods substantially 
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Table 15: Out of sample MSFE of one-step ahead forecasts after 100 simulations: Scenario 5. Standard errors are shown in parentheses. 


Model/VARX-L Penalty Structure 

MSFE 

MSFE Relative to Sample Mean 

Basic 

0.0780 (0.0010) 

0.6142 

Lag Group 

0.0731 (0.0008) 

0.5724 

Own/Other Group 

0.0736 (0.0009) 

0.5761 

Sparse Lag Group 

0.0747 (0.0009) 

0.5848 

Sparse Own/Other Group 

0.0747 (0.0009) 

0.5849 

Endogenous-First 

0.0736 (0.0009) 

0.5766 

VARX with lags selected by AIC 

0.1207 (0.0016) 

0.9444 

VARX with lags selected by BIG 

0.1157 (0.0022) 

0.9053 

BGR’s Bayesian VAR 

0.1148 (0.0018) 

0.8982 

Sample Mean 

0.1278 (0.00296) 

1.0000 

Random Walk 

0.2020 (0.00369) 

1.5806 



outperform the information criterion benchmarks as well as BGR’s Bayesian VAR. The Lag Group VARX-L achieves 
the best performance, which is to be expected. Since the Lag Group VARX-L has fewer groups than the Own/Other 
structures and does not impose within-group sparsity, it has more of a tendency to employ ridge-like penalization 
as opposed to setting an entire group to zero. Within the class of VARX-L models, the Basic VARX-L performs the 
worst. This suggests that structured groupings are more robust in applications where the true model is not sparse. 
Scenario 6: Non-diagonal Covariance 

The previous five simulation scenarios impose a diagonal structure on S„, the contemporaneous covariance 
matrix. Such a scenario may rarely occur in practice. In order to examine the robustness of the VARX-L procedures 
in the presence of a non scaled-identity covariance matrix, this scenario pairs the sparsity pattern from Scenario 1 
with the covariance structure depicted in Figure [l^ 

Note that unlike the least squares VARX with lags selected by AIC or BIG, as well as BGR’s Bayesian VAR, 
the VARX-L procedures do not explicitly incorporate Tiu- Hence, one should expect the benchmark procedures to 
be at a slight advantage in this scenario. The results from this scenario are summarized in Table |16| 


In this scenario (see Table 16) we observe that the Basic VARX-L, which conforms to the true sparsity pattern 
achieves the best performance, following by the Sparse Own/Other Group VARX-L despite the more complex 
error structure. As in every other scenario, all VARX-L models substantially outperform the benchmarks. BGR’s 
Bayesian VAR performs substantially worse than both information criterion based methods, even though it explicitly 
incorporates the covariance of Uj; its poor performance is likely the result of incorporating an unreliable covariance 
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Table 16: Out of sample MSFE of one-step ahead forecasts after 100 simulations: Scenario 6. Standard errors are shown in parentheses. 


Model/VARX-L Penalty Structure 

MSFE 

MSFE Relative to Sample Mean 

Basic 

0.7982 (0.0113) 

0.1034 

Lag Group 

0.9089 (0.0133) 

0.1177 

Own/Other Group 

0.8926 (0.0134) 

0.1156 

Sparse Lag Group 

0.8833 (0.0122) 

0.1144 

Sparse Own/Other Group 

0.8746 (0.0127) 

0.1133 

Endogenous-First 

0.9283 (0.0135) 

0.1202 

VARX with lags selected by AIC 

1.4237 (0.0232) 

0.1844 

VARX with lags selected by BIG 

1.0895 (0.0185) 

0.1411 

BGR’s Bayesian VAR 

2.8876 (0.0782) 

0.4022 

Sample Mean 

7.1789 (0.3739) 

1.0000 

Random Walk 

4.2543 (0.1465) 

0.5511 


Figure 12: 


Covariance Matrix: Simulation Scenario 6 




estimate. For an expanded discussion and simulation study involving non-diagonal covariance structures as well as 


a procedure to incorporate covariance in fitting VARX-L models, consult Nicholson et al. (2016). 

Overall, all of the proposed VARX-L models are fairly robust to sparsity patterns not conforming to their true 
group structures. In each scenario, every method substantially outperforms all benchmark procedures. Scenario 1 
is the only case in which the structured approaches perform poorly relative to the Basic VARX-L. We expect such 
an unstructured sparsity pattern to occur only rarely in macroeconomic applications. 


6 Conclusion 

We have shown that the proposed VARX-L structured regularization framework is very amenable to the VARX 
setting in that it can simultaneously reduce its parameter space and still incorporate useful information from both 
endogenous and exogenous predictors. VARX-L models scale well with the dimension of the data and are quite 
flexible in accommodating a wide variety of potential dynamic structures. Each of the proposed methods consistently 
outperforms benchmark procedures both in simulations and in macroeconomic forecasting applications. Forecast 
performance of all models appears to be robust across multiple sparsity structures as well as forecast horizons. 
Moreover, upon examining actual macroeconomic data, structured VARX-L models tend to outperform the Basic 
VARX-L. 
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Our work has considerable room for extensions. This paper focuses solely on forecasting applications, but our 


VARX-L framework could also be extended to structural analysis and policy evaluation using an approach similar 


to that of Furman (20141. In addition, our current implementation requires a coherent maximal lag selection 
mechanism. The common procedure of choosing a lag order based on the frequency of the data is problematic in 
that it can lead to overfitting. One could potentially incorporate an additional penalty parameter that grows as the 


lag order increases, as in Song and Bickel (2011), but this approach requires a multi-dimensional penalty parameter 
selection procedure and subjective specification of a functional form for the lag penalty. 

An R package containing our algorithms and validation procedures, BigVAR, is available on the Comprehensive 
R Archive Network (CRAN). 
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A Appendix 

A.l Compact Matrix Notation 

In deriving the solution methods for our algorithms, we find it convenient to express the VARX using compact 
matrix notation 


Y = [yi, ■ • ■ ,yT] 


(fcxT); 


1 = [1,...,1]T (Txl). 



B = [$,/3] 


[k y. {kp + ms)]] U = [ui, ..., ut] {k x T) 


Equation (2.1) then becomes 


Y = ul^ +BZ + IJ 


and the least squares procedure (2.2) can be expressed as minimizing ^\\Y — over ix and B. 


A.2 Intercept Term 


In regularization problems, the intercept £> is not typically regularized and can instead be derived separately. 


Using compact matrix notation, we can express the unpenalized portion of (2.3) as 


f{B,jx) = ^\\Y-,xl-^ -BZ\\% 


(A.l) 


We can find i> by calculating the gradient of (A.l) with respect to u 


0 = Vuf{B, ix) = iY- i>lT - BZ)1, 


i'kW =Yk- — BZk 
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in which Yk- 


(A.l) as 


kti and Zk- = This provides some insight into the scaling, as we can rewrite 


mm-\\Y - {Y - BZ)1^ - BZ\\%, 


■.mm^\\{Y-Yl^)-B{Z-Zl 


Tm|2 


(A.2) 


in which K is a fc x 1 vector of row means and Z is a {kp + ms) x 1 vector of row means. 

A.3 Solution Strategies 


In the following sections, assume that Y and Z are centered as in Equation (A.2). 

A.3.1 Basic VARX-L 

Utilizing the coordinate descent framework, we can find B via scalar updates. To generalize to a multivariate 
context, we can express the one-variable update for the (j, r) entry of B, Bjr as 


min - jt — BjiZ^t — Bj^Zjt)'^ + \\Bjr\. 


Let Rf = Y jt — 'Yhi^r BjtZa denote the partial residual. Then, we can rewrite Equation (A.3) as 


(A.3) 


9 jr{B) — min 'y BjrZjt) + X\Bjr\ 

Bjr t 

— 9 ~ Bjr^jt ~ ‘^^tZjtBjr) -b X\Bjr\. 

Bjr ( 


Now, differentiating with respect to Bjr gives the subgradient as 


dgjr{B) 9 Bjr Z'^jt — 'RtZjt + Xijj{Bjr), 
t t 


where we define ip {Bjr) as 

I {sgn(Bjr)} Bjr 7^ 0 

Ip e < 

[- 1 , 1 ] B,r=Q. 
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For Bjr to be a global minimum, 0 G dg{Bjr). After some algebra, the optimal update can be expressed as 


B jj- ^— 


ST{Y.t^tZ,u\) 


Where ST represents the soft-threshold operator 

ST{x,(j)) = sgn(a:)(|x| - </>)+, 

sgn denotes the signum function, and (|a;| — (/>)+ = max(|a;| — (^,0). The Basic VARX-L procedure is detailed in 
Algorithm 

A.3.2 Lag Group VARX-L 

Rather than vectorizing the Lag Group VARX-L and solving the corresponding univariate least squares problem, 
if the groups are proper submatrices we can exploit the matrix structure for considerable computational gains. 
Without loss of generality, we will consider the “one lag” problem for (the problem for is analogous). 


2 




F, 


(A.4) 


in which, for notational ease, we directly incorporate the weighting into the penalty parameter by defining A = k\, 
Rg = — Yg again represents the partial residual. Taking the gradient of ||R-q — with 

respect to results in 

= V*(„ Tr f(R_, - , 


2 ' 


In which Tr denotes the trace operator. The subgradient of (A.4) with respect to is then 
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where uj is defined as 




ill 


;($(9)) = 


$(?) ^ 0 


{U : \\U\\f < 1} 


(g) 

Consider the case where $ =0. Then 


-^ e {U : \\U\\f < 1}, 


-9-^9 




We can conclude that = 0 


IIp’ < A. Now, assuming ^ 0, we have that 


^^“^Z.Zj -R^,Zj + X{ 


\ii) 


!$(9)| 


-) = o, 


^^'^^ZgZj +X{ 


f.(9) 


-)=R-gZj, 


f Z,Z^ + 1 = R-aZj 


|$( 9 )| 


- 9-^9 ■ 


Now, since ZqZ^ is positive semidefinite and A > 0, we can infer that ZqZ^ + 


T 

9-^9 


11 $ 


(9) I 


hence it is possible to create a trust region subproblem that coincides with Equation (A.4). 


transform R^qZ^ s into a vector. Define 


Tq = Vec{R-qZq), 

Gq = ZqZ^ ® If;, 

4>q = vec($^'^^). 


in which 0 denotes the Kronecker product. Hence, we can rewrite Equation (A.51 as 


< (G, 


110 . 


q\\F 


-/fc 2 ) = rq. 


(A.5) 

fc is positive definite. 
However, we need to 
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Applying the same transformation to the original subproblem, we can express Equation |A.4| as the trust region 
subproblem 


min ^(j)lGqGqCf>q +rgCf)q, 

S.t.|l</>,||F < A, 


in which A > 0 denotes to the trust-region radius which corresponds to the optimal solution of Equation | A. 4| These 
modifications allow for the use of the block coordinate descent algorithm described in Qin et ah] ( |2010[ ). Expanding 
upon their arguments, by the Karush-Kuhn-Tucker (KKT) conditions, we must have that: A(A — ||^*||f) = 0, 


which implies that ||<Ag||F = A. Then, applying Theorem 4.1 of Nocedal and Wright (1999), we can conclude that 


</.! = - G„ + -7fe2 


(A. 6 ) 


Qin et al. (2010) remarks that Equation (A. 6 ) can also be expressed as 4>* = Ayq{A), where 


yg{A) = - {AGq + XIk2) V,, 


(A.7) 


Note that, based on the KKT conditions, ||j/q(A)||F = 1- Hence, the optimal A can be chosen to satisfy ||2/q(A)||F = 


1. We can efficiently compute ||j/q(A)|||, via an eigen-decomposition of Gg. We start by rewriting Equation (A.7) 


as 


yg{A) = - {AWVW^ + Xh2) \g, 

= -WiAV + Xh2)-^W^rg, 

in which the first line follows from the spectral decomposition of a symmetric positive semidefinite matrix. Einally, 
we can express ||?/ 5 (A)|||, as 


|| 2 /,(A)|i^ = ^ 


(v,A + A) 2 ’ 
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in which Wi denotes the columns of W and Vi the diagonal elements of V. 
determine the optimal A by applying Newton’s method to find the root of 


Qin et al. (20101 notes that we can 


fl(A) = 1 - 


II%(A)||f- 


(A.8) 


The full Lag Group VARX-L procedure is detailed in Algorithm Our algorithm organizes iterations around an 


active-set” as described in Friedman et al. (2010). This approach starts by cycling through every group and then 


only iterating on the subset of B that are nonzero (the “active-set”) until convergence. If a full pass through all 
B does not change the active set, the algorithm has converged, otherwise the process is repeated. This approach 
considerably reduces computation time, especially for large values of A in which most model coefficients are zero. 

A.3.3 Own/Other Group VARX-L 

In the Own/Other setting since the groups are not proper submatrices, in order to properly partition each 


into separate groups for own and other lags. Equation (2.3) must be transformed into a least squares problem. To 
perform a least squares transformation, we define the following 


T—qq — Vec(R_gg), 
Cjjqq = Vec($^«^), 

Mqq = {Z^ <»Ik)qq. 


Then, the one block subproblem for own lags (group qq) can be expressed as 


min \\MqqCf)qq + T-qqWp + A||^qg|ji7', 

*Pqq " 

~ 4’qq^qq^qq4’qq T '^—qq^qq4‘qq T -^11 II F, 

4>qq ^ 

~ O^qq^qq^’ll^’qq T '^—qq^qq4^qq T '^ll 099 II • 

4>qq ^ 

At ^qq, we must have that 0 G df{cf)qq). The subgradient can be expressed as 


d 

d4>qq 


^qq^qq'pqq T ^qq^qq A ^^{4*qq)^ 
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where uj is defined as 


;(s) e 


i llsllc. } 


s ^ 0 


{u : ||m||f <1} s = 0. 

Thus, after applying these transformations, we can apply a slightly adapted version of Algorithm 

A.3.4 Sparse Lag Group VARX-L 

As with the Lag Group VARX-L, we will consider the one-block subproblem for lag 


mm -||R„, - + (1 - a)A|l$(^)|jF + 

^("j) ZK 


(A.9) 


Since the inclusion of within-group sparsity does not allow for separability, coordinate descent based procedures 


are no longer appropriate, therefore, following Simon et al. (20131 our solution to the Sparse Lag Group VARX-L 


utilizes gradient descent based methods. We express Equation (A.9) as the sum of a generic differentiable function 
with a Lipschitz gradient and a non-differentiable function. 

We start by linearizing the quadratic approximation of the unpenalized loss function that only makes use of 


first-order information around its current estimate (borrowing from Simon et al. (2013), for notational ease, 


let $ = represent the unpenalized loss function, and 7^($) represent the penalty term). Then, we can 

express the linearization as 


M($, $o) = ^(^o) + vec($ - $o)^vec(V^($o)) + " ^o||| + Vi^) 

= 4||R-, - + (^ - ^0, - R^g)Z^) + - $o||| + ^(^), 


2k' 


2d' 


in which d represents the step size. Removing terms independent of our objective function becomes 


argminM($, 4>o), 

$ 

= argmin;^||$ - (*o - 
^ 2d \ 


T \ ||2 


9 mif -c Vi^). 


Then, generalizing the arguments outlined by Simon et al. (2013), we can infer that the optimal update 17(4*) can 
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be expressed as 


[/($)= 1 - 


d(l — a)A 


\\ST{^o - di^oZg - Il_g)Z^,daX)\\F^ 


ST{^o - di^oZg - R-,)Zl„, daX) 


As in Simon et al. (20131, we apply a Nesterov accelerated update. At step j, we update according to 


4[j] ^ ^[j - 1] + - 1]), (A.IO) 

J +3 


which, per 


gradient descent. 

We use a constant step size according to the Lipschitz constant, H, which must satisfy 


Beck and Teboulle (2009) converges at rate l/p as opposed to the 1/j rate of the standard proximal 


WVxiiX) -^re{Y)\\ < H\\X -Y\\. 


Consider two submatrices and We have that 


Vc(,)£(C(«)) = - R_,Z^, 

Va(,)^(A( 9)) - Vc(,)^(C(«)) = (A(«) - C(«))Z,ZT, 
||(A(«) _ C(^))Z,Z^II2 < ||A(«) - C(«)||2||Z,Z^ 112. 


The last inequality follows from the sub-multiplicity of the matrix 2-norm. Therefore, we can conclude that the 
Lipschitz constant is \ \ZqZ ^\\2 = (Ti(ZgZ^), i.e. the largest singular value of ZqZ^, which has dimension k x k 
for ..., and is a scalar for exogenous groups. Since ZqZj is symmetric and positive semidefinite, it is 
diagonalizable, and the maximum eigenvalue can be efficiently computed using the power method, described in 


Golub and Van Loan (2012). 


As only the maximum eigenvalue is required, the power method is much more computationally efficient than 
a computation of the entire eigensystem. Moreover, we retain the corresponding eigenvector produced by this 
procedure to use as a “warm start” that substantially decreases the amount of time required to compute the 
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maximal eigenvalue at each point in time during the cross-validation and forecast evaluation procedures. 

In a manner similar to Algorithm an “active-set” approach is used to minimize computation time. The inner 
loop of of the Sparse Group VARX-L procedure is detailed in Algorithm An outline of the algorithm is below: 

1. Iterate through all groups. For each group: 

(a) Check if the group is active via the condition: Zq — R_q )Z^||^<(l-a)A. 

(b) If active, go to the inner loop (Algorithm]^, if not active, set group identically to zero. 

(c) Repeat until convergence. 


Upon performing the least squares transformations as in the Own/Other Group VARX-L (section A.3.31, the 
Sparse Own/Other Group VARX-L follows almost the exact same procedure as its lag counterpart. 

A.3.5 Endogenous-First VARX-L 

The Endogenous-First VARX-L is of the form 


min - IIV 



Since the optimization problem decouples across rows, we will consider solving the one row subproblem (for row i) 


min 

2 " 


e=i ^ 



+ m 


w 



(A.ll) 


In a manner similar to the Sparse Group VARX-L, the Endogenous-First VARX-L is solved via proximal gradient 
descent. For ease of notation, let 7^($,/3) represent the nested penalty. The update step for the Endogenous-First 
VARX-L (at step j) can be expressed as 


B,[j\ = Prox^^_^(^ - 1] - (rsJ(.{B,[j - 1])), 


(A.12) 


in which d denotes step size and £{Bi[j — 1]) denotes the unpenalized loss function. Note that V£{Bi) = —(Vj — 


BiZ)Z^. Similar to the Sparse Group VARX-L setting, a fixed step size is used; d = 


dZZ^) 


. To speed 


convergence, as in the Sparse Group VARX-L update step (A. 101, we apply a similar Nesterov-style accelerated 
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update: 


B, ^ B,[j - 1] + - 1] - Hj - 2]), 


Thus, (A. 12) becomes 


B,[j] = - dVe{B,)), 


(A.13) 


Jenatton et al. (2011) observed that the dual of (A.13) can be solved with one pass of block coordinate descent. 


Moreover, the block updates are extremely simple and available in closed-form. Algorithm details the prox 
function for the Endogenous-First VARX-L. Note that it consists of p separate nested structures for each series. 


Thus, solving (A.13) essentially amounts to calling the same proximal function p times at each update step. 


A.4 


Banbura et al. 


(2009) Implementation 


The Bayesian VAR proposed by [Banbura et al.| ( |2009| ) utilizes a normal inverted Wishart Prior. Defining 
(j) = vec(€>), the prior has the form 


(j)\V, ^ O 0 Oq) 
Q iW{So,ao), 


in which iW denotes the inverse Wishart distribution. This prior is implemented by adding the following dummy 
observations to Y and (which we define as X): 


Y 


di 


^diag((5(Ti,... ,(5crfe)/A^ 


^kx{p—l)xk 
diag(cri, ...,ak) 
Oixfc 


V 


/ 


^Ofepxi Jp0diag(cri,...,crfe)^ 


Xd, = 


Ofc 


xl 


^kxkp 

Olxfcp 
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The scale parameters for the prior variances of each series, ai,... ,ak, are estimated by univariate autoregressive 


models. Jp = diag(l, 2 ,... ,p), and e denotes the prior on the intercept and is set to a very small number (e.g. 
le-5). 6 serves as an indicator for the prior belief that the series have high persistence. We set 5 = 0 in all of our 
forecasting applications except for the Minnesota VARX-L application in section 

In addition to the above construction, following Doan et al.| ( |1984 ), BGR adds a prior that imposes a bound on 
the sum of coefficients by shrinking 11 = (J^ — — ... — toward zero. This prior is implemented by adding 

the additional dummy observations 


Yd^= diag((5^i,.. .,6fik)/T, 

^d 2 = ^Ofcxi lixp ® diag((5^i,... ,(5^fc)/T^ > 

in which pi,..., are meant to capture the average level of each of the series, set according to their unconditional 
means and r denotes a loose prior which is set to lOA. After appending the dummy observations to Y and X and 
creating the augmented matrices W* and X the posterior mean can be calculated in closed form as: 

= (xJx^y^xjY^ 


A.5 Penalty Grid Selection 


Table 17: Starting values of the penalty grid for each procedure; pq denotes the number of variables in group q. 


Structure 

Starting Value of A^rid 

Basic 


Lag Group 

maXqdjZq'K^llj?) 

Sparse Lag 

maxq {\\ZqY^\\p)a 

Own/Other Group 

maxq ( |(Z^ ® Ik)qvec{Y)\\F/^^) 

Sparse Own/Other Group 

maxq (||(Z^ ® Ik)qvec(Y)\\F/^^)a 

Endogenous-First 

muxiiWZYj IIj?), 


A.6 Algorithms 
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Algorithm 1 Basic-VARX-Lfe „j(p, s) 

Require: 

gOLD ^ gINI 

repeat 

for 2 = 1,. .., /c, j = 1,. .., /cp + ms do 

R ^ 

r^. dNEW . ST(X;t,A) 

-“y ■* y', Z2 

^jt 

end for 

^OLD ^ ^NEW 

until Desired threshold is reached 
ly - ^NEW2 

10: return 


Algorithm 2 Basic VARX-L(p,s) Cross-Validation 

Require: 1^, Agrid A 

^LAST ^ ^INI 

for t in [Ti, T 2 — /i] do 

\/’(^) /_ \r 

^ TRAIN ^ ^ ,h:(t-l) 

-^TRAIN 

for i in AQ^id do 

i.,, RNEW ^ Basic-VARX-L(yW^jN, ain. 

-^TEST ^ -^,(^+1) 

SSFE(^A ^ \\Yt+h - ^ ^■i_^z%s^]\\l 

]^Q. ^LAST ^ ^NEW 

end for 

for i in AQ^id do 

MSFE^') ^ T^-rl-h+i Tt SSFE^^A 

15: end for 

end for 

return A^, where i = argmin^ MSFE^'^^ 
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Algorithm 3 Lag Group VARX-Lj, ,„(p, s) with active-set strategy 

Require: Bjni, 5, V, Z, Aini, A 
Define: 


for g = 1,..., p -|- ms : 

Gg = Mg (g)Ifc. 


Sao,ini <— Binii 
Aao,ini <— Aini, 
for A e Agrid do 
5: repeat 


■<—ThresholdUpdate(AA, i A) 

■BA,ylFULL’ aIa •i-BlockUpdate(AFULL,-BA,^^, A) 

untilBA.^A =:Ba,^full 

i> Y — Bx j\^Z 

10: end for 

return i>, Sa, 

procedure BlockUpdate(^, Bini, A) 

B ^ Bini 
for g £ G do 

15: R B-gZ-g - Y 

r^RZT 
if l|r||F < A then 


20 : 


25: 


30: 


B* <- 0 
Ag <- 0 


Isl 


end if 

if llrllF > A then 

A <— the root of f2(A) defined in | |A.8| | 

vec(Bg)^-{Gg-h Ai)-ir 

Ag •«— 3 

end if 
end for 
return Bx, A 
end procedure 

procedure ThresholdUpdate(^a, A) 

if Aa = 0 then return Okxkp+ms 
end if 

if Aa i=- 0 then 

-Ba,old <— -Ba.ini 

repeat 


35: -Ba,neWi-Aa <— BlockUpdate(AA,-B a.olDi A) 

■Ba.old <— Ba,new 
until Desired threshold is reached 

end if 

return Ba,neWi-A 
40: end procedure 


\> Makes one full pass through all groups 


> Iterates through active set until convergence 


Algorithm 4 Sparse Lag Group VARX-L inner loop 


Require: 4>oi -^gi , a 

h ' 1 


#0 
repeat 

i 

■y] 


- 


(#"Z,-R_,)Zj 

- k 


vec( 70 +l)) •!- ( 1 - 


h,(l —ct) A 


10 : 


\\ST(^i Z ,ha\)\\p 

^J + 1 yj + i — 7-’) 

i ^ i + 1 


until Desired threshold is reached 


5T(vec(^^) - hvec{Fq),haX) 


50 











Algorithm 5 Endogenous-First VARX-L Proximal Problem 
Require: v,X,k,p,m,s 

for i=l,... ,p do 

91 ^ [((i -1) ■ k + l) : ■ k + k)] 

92 ^ [((*- 1 ) ■ m + p ■ k) : (^{i — 1) ■ m + p • k + to)] 

^{si,S2} ^ Pi'Ox(w{gi,g,},A,fc) 

end for 
return v. 

procedure PROx(ti,A,k) 
h2 ^ {k + 1) : [k + to) 
hi ^ 1 : (k + to) 

for j = 1,2 do 

Vh^ ^ (1 - \/\\Vh^\\F) + Vh, 

end for 
return v. 
end procedure 
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